library(tidyverse) # for graphing and data cleaning
library(gardenR) # for Lisa's garden data
library(lubridate) # for date manipulation
library(ggthemes) # for even more plotting themes
library(geofacet) # for special faceting with US map layout
theme_set(theme_minimal()) # My favorite ggplot() theme :)
# Lisa's garden data
data("garden_harvest")
# Seeds/plants (and other garden supply) costs
data("garden_spending")
# Planting dates and locations
data("garden_planting")
# Tidy Tuesday data
kids <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-15/kids.csv')
Before starting your assignment, you need to get yourself set up on GitHub and make sure GitHub is connected to R Studio. To do that, you should read the instruction (through the “Cloning a repo” section) and watch the video here. Then, do the following (if you get stuck on a step, don’t worry, I will help! You can always get started on the homework and we can figure out the GitHub piece later):
keep_md: TRUE in the YAML heading. The .md file is a markdown (NOT R Markdown) file that is an interim step to creating the html file. They are displayed fairly nicely in GitHub, so we want to keep it and look at it there. Click the boxes next to these two files, commit changes (remember to include a commit message), and push them (green up arrow).Put your name at the top of the document.
For ALL graphs, you should include appropriate labels.
Feel free to change the default theme, which I currently have set to theme_minimal().
Use good coding practice. Read the short sections on good code with pipes and ggplot2. This is part of your grade!
When you are finished with ALL the exercises, uncomment the options at the top so your document looks nicer. Don’t do it before then, or else you might miss some important warnings and messages.
These exercises will reiterate what you learned in the “Expanding the data wrangling toolkit” tutorial. If you haven’t gone through the tutorial yet, you should do that first.
garden_harvest data to find the total harvest weight in pounds for each vegetable and day of week (HINT: use the wday() function from lubridate). Display the results so that the vegetables are rows but the days of the week are columns.garden_harvest %>%
mutate(day = wday(date, label = TRUE)) %>%
group_by(vegetable, day) %>%
summarize(tot_harvest = sum(weight)*0.00220462) %>%
pivot_wider(id_cols = c("vegetable", "tot_harvest"),
names_from = day,
values_from = tot_harvest)
garden_harvest data to find the total harvest in pound for each vegetable variety and then try adding the plot from the garden_planting table. This will not turn out perfectly. What is the problem? How might you fix it?garden_harvest %>%
group_by(vegetable, variety) %>%
summarize(tot_harvest = sum(weight)*0.00220462) %>%
left_join(garden_planting %>% select(plot, vegetable, variety),
by = c("vegetable", "variety"))
Response: There are two problems with this join. One is that it is missing several values for plot, which is likely because those vegetables were not planted on purpose that year (for example beet leaves), so they weren’t recorded in garden_planting. To fix this you could replace this NA with something like “previously planted”. Another problem is that some varieties are separated into several rows when you planted on vegetable in different plots. This could be fixed if you had collected data on which plots you harvested from. Now that you’re working with imperfect data, you have to choose what to eliminate. Perhaps adding code that only keeps the first plot that a variety is planted in.
garden_harvest and garden_spending datasets, along with data from somewhere like this to answer this question. You can answer this in words, referencing various join functions. You don’t need R code but could provide some if it’s helpful.Response: In the garden_spending data set, compute the amount of money you spent on seeds for each vegetable by adding a new variable “total_cost”. In the garden_harvest data set, compute the total lbs harvested for each vegetable and add a variable with data from the grocery store about the cost of a lb of each vegetable. Then summarize both data sets to include only the new variables and vegetable and left_join them by vegetable (so as to keep harvest data for things you didn’t record planting) to produce a single data set with the variables “vegetable”, “total_cost”, “total_harvest_lbs”, and “grocery_cost_perlb”. Add a new column to this data set called “savings” by mutate(savings=((total_harvest_lbs*grocery_cost_perlb)-total_cost)).
garden_harvest %>%
filter(vegetable == "tomatoes") %>%
mutate(variety = fct_reorder(variety, date, min, .desc = TRUE)) %>%
group_by(variety) %>%
summarize(tot_harvest_lbs = sum(weight)*0.00220462, date) %>%
ggplot(aes(x = tot_harvest_lbs, y = variety)) +
geom_col() +
labs(title = "Total Harvest of Tomato Varieties in Order of First Harvest",
x = "Total Harvest (lbs)",
y = "Later First Harvest <---------------------> Earliest First Harvest")
garden_harvest data, create two new variables: one that makes the varieties lowercase and another that finds the length of the variety name. Arrange the data by vegetable and length of variety name (smallest to largest), with one row for each vegetable variety. HINT: use str_to_lower(), str_length(), and distinct().garden_harvest %>%
mutate(lowercase = str_to_lower(variety),
length = str_length(variety)) %>%
group_by(vegetable, length) %>%
summarize(vegetable, variety, lowercase, length) %>%
distinct()
garden_harvest data, find all distinct vegetable varieties that have “er” or “ar” in their name. HINT: str_detect() with an “or” statement (use the | for “or”) and distinct().garden_harvest %>%
select(vegetable, variety) %>%
distinct() %>%
arrange(vegetable, variety) %>%
mutate(has_aer = str_detect(variety, "er|ar")) %>%
filter(has_aer == TRUE)
In this activity, you’ll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.
{300px}
{300px}
Two data tables are available:
Trips contains records of individual rentalsStations gives the locations of the bike rental stationsHere is the code to read in the data. We do this a little differently than usualy, which is why it is included here rather than at the top of this file. To avoid repeatedly re-reading the files, start the data import chunk with {r cache = TRUE} rather than the usual {r}.
data_site <-
"https://www.macalester.edu/~dshuman1/data/112/2014-Q4-Trips-History-Data.rds"
Trips <- readRDS(gzcon(url(data_site)))
Stations<-read_csv("http://www.macalester.edu/~dshuman1/data/112/DC-Stations.csv")
NOTE: The Trips data table is a random subset of 10,000 trips from the full quarterly data. Start with this small data table to develop your analysis commands. When you have this working well, you should access the full data set of more than 600,000 events by removing -Small from the name of the data_site.
It’s natural to expect that bikes are rented more at some times of day, some days of the week, some months of the year than others. The variable sdate gives the time (including the date) that the rental started. Make the following plots and interpret them:
sdate. Use geom_density().Trips %>%
ggplot(aes(x = sdate)) +
geom_density() +
labs(title = "Trend of Bike Rentals in DC, October to December 2014",
x= "2014",
y="")
Description: This density plot shows that bike renting overall was on a downward trend from October to January (maybe because of the increasingly cold weather), but there were lots of ups and downs throughout the period. There were particular lows right at the beginning of October, in late November, and late December, with the highest ridership happening in October.
mutate() with lubridate’s hour() and minute() functions to extract the hour of the day and minute within the hour from sdate. Hint: A minute is 1/60 of an hour, so create a variable where 3:30 is 3.5 and 3:45 is 3.75.Trips %>%
mutate(time = hour(sdate)+(minute(sdate)*1/60)) %>%
ggplot(aes(x = time)) +
geom_density() +
labs(title = "Trend of Start Times for Bike Rentals in DC",
x = "Time of Day (24 hour clock)",
y= "")
Description: This graph shows the distribution of start times for bike rentals throughout the day is highest in the “waking hours”, around 8am-8pm, with particular spikes around 8am and 6pm (perhaps people riding to and from work) and a small bump at both 1am and 1pm. The time with the lowest bike rentals is around 4am.
Trips %>%
mutate(day = wday(sdate, label = TRUE)) %>%
group_by(day) %>%
summarize(events = n()) %>%
ggplot(aes(x = events, y = fct_rev(day))) +
geom_col() +
labs(title = "Bike rentals in DC by Day of the Week",
x = "Rentals",
y = "")
Description: This bar graph of the number of bike rentals on each day of the week shows that ridership is highest during the weekdays, close to 100,000 rentals, while the ridership on Saturday and Sunday are around 85,000. The day with the highest ridership is Thursday, and the lowest ridership is on Sunday.
Trips %>%
mutate(time = hour(sdate)+(minute(sdate)*1/60),
day = wday(sdate, label = TRUE)) %>%
ggplot(aes(x = time)) +
geom_density() +
facet_wrap(vars(day)) +
labs(title = "Trend of Start Times for Bike Rentals in DC by Day of the Week",
x = "Time of Day (24 hour clock)",
y= "")
Description: Faceting the start time density graph for bike rental trips shows two general patterns: one for weekends and one for weekdays. On the weekends, bike rentals start mostly in the middle of the day between 9am and 5pm with a bump around 1am, maybe from people choosing to rent a bike to leave a bar or party. On weekdays, there are spikes in the morning and evening around 8am and 6pm, probably from people going to and from work.
The variable client describes whether the renter is a regular user (level Registered) or has not joined the bike-rental organization (Causal). The next set of exercises investigate whether these two different categories of users show different rental behavior and how client interacts with the patterns you found in the previous exercises.
fill aesthetic for geom_density() to the client variable. You should also set alpha = .5 for transparency and color=NA to suppress the outline of the density function.Trips %>%
mutate(time = hour(sdate)+(minute(sdate)*1/60),
day = wday(sdate, label = TRUE)) %>%
ggplot(aes(x = time, fill = client)) +
geom_density(alpha = .5, color = NA) +
facet_wrap(vars(day)) +
labs(title = "Trend of Start Times for Bike Rentals in DC by Day of the Week",
x = "Time of Day (24 hour clock)",
y= "")
position = position_stack() to geom_density(). In your opinion, is this better or worse in terms of telling a story? What are the advantages/disadvantages of each?Trips %>%
mutate(time = hour(sdate)+(minute(sdate)*1/60),
day = wday(sdate, label = TRUE)) %>%
ggplot(aes(x = time, fill = client)) +
geom_density(alpha = .5, color = NA, position = position_stack()) +
facet_wrap(vars(day)) +
labs(title = "Trend of Start Times for Bike Rentals in DC by Day of the Week",
x = "Time of Day (24 hour clock)",
y= "")
Description: I think that this graph with the stacked client densities is actually worse at telling a good story than the overlapping one about the difference in use between registered and casual bike renters. In the previous graph you can clearly see that the rentals by registered users happen in spikes before and after work on weekdays and peaks 3pm on weekends while the rentals by casual users peaks at 3pm-ish every day of the week and is much more concentrated to the middle of the day throughout the week compared to registered users. The stacked graph loses the character of whatever data set is on top, in this case the casual users, making it hard to tell where their use really peaked each day. However, the stacking graph allows you to see the overall trends of bike rentals combining causal and registered riders, which is hard to see in the overlapping graph.
position = position_stack()). Add a new variable to the dataset called weekend which will be “weekend” if the day is Saturday or Sunday and “weekday” otherwise (HINT: use the ifelse() function and the wday() function from lubridate). Then, update the graph from the previous problem by faceting on the new weekend variable.Trips %>%
mutate(time = hour(sdate)+(minute(sdate)*1/60),
weekend = ifelse(wday(sdate, label=TRUE) == c("Sat", "Sun"), "weekend", "weekday")) %>%
ggplot(aes(x=time, fill = client)) +
geom_density(alpha = .5, color = NA) +
facet_wrap(vars(weekend)) +
labs(title = "Trend of Start Times for Bike Rentals in DC on Weekends vs Weekdays",
x = "Time of Day (24 hour clock)",
y= "")
client and fill with weekday. What information does this graph tell you that the previous didn’t? Is one graph better than the other?Trips %>%
mutate(time = hour(sdate)+(minute(sdate)*1/60),
weekend = ifelse(wday(sdate, label = TRUE) == c("Sat", "Sun"), "weekend", "weekday")) %>%
ggplot(aes(x = time, fill = weekend)) +
geom_density(alpha = .5, color = NA) +
facet_wrap(vars(client)) +
labs(title = "Trend of Start Times for Bike Rentals in DC on Weekends vs Weekdays",
x = "Time of Day (24 hour clock)",
y= "",
fill= "")
Description: This graph that facets by client allows you to see more clearly that casual riders have mostly the same riding start time pattern on weekends and weekdays, which is harder to see in the previous graph. The other graph shows a better comparison of the riding habits of casual and registered riders, which is more difficult to see in this graph since they are separated. Overall however, I think both graphs communicate the same information in easy to understand ways, and I don’t find one to be significantly better than the other.
Stations to make a visualization of the total number of departures from each station in the Trips data. Use either color or size to show the variation in number of departures. We will improve this plot next week when we learn about maps!Stations %>%
summarize(name, lat, long) %>%
left_join(Trips %>% summarize(sdate, sstation),
by = c("name" = "sstation")) %>%
group_by(name) %>%
summarize(departures = n(), lat, long) %>%
distinct() %>%
ggplot(aes(x = long, y = lat, color = departures)) +
geom_point() +
labs(title = "Popularity of Departure Stations for Bike Rentals in DC by Location",
x = "Longitude",
y = "Latitude")
Stations %>%
summarize(name, lat, long) %>%
left_join(Trips %>% summarize(sdate, sstation, client),
by = c("name" = "sstation")) %>%
group_by(name) %>%
summarize(perc_casual = sum(ifelse(client == "Casual", 1, 0))/n()*100,
lat, long) %>%
distinct() %>%
ggplot(aes(x = long, y = lat, color = perc_casual)) +
geom_point() +
scale_color_gradient(low="lightgrey", high= "firebrick1") +
labs(title = "Percentage of Casual Users at Departure Stations for Bike Rentals in DC",
x = "Longitude",
y = "Latitude",
color = "Percent of Casual Renters")
Description: In this graph of the percentage of casual bike renters you can see that some stations have much higher casual usership, and they tend to be in a clump in the middle, with that percentage generally decreasing the further away from this area that you get. This might be a downtown area of DC. There is also an area in the Northwest that has a somewhat higher percent of casual users.
as_date(sdate) converts sdate from date-time format to date format.top_ten_date_stat<-Trips %>%
mutate(date = as_date(sdate)) %>%
group_by(sstation, date) %>%
summarize(departures = n()) %>%
arrange(desc(departures)) %>%
ungroup() %>%
slice(1:10)
print(top_ten_date_stat)
## # A tibble: 10 x 3
## sstation date departures
## <chr> <date> <int>
## 1 Lincoln Memorial 2014-10-25 386
## 2 Lincoln Memorial 2014-10-18 354
## 3 Lincoln Memorial 2014-10-26 349
## 4 Columbus Circle / Union Station 2014-10-27 345
## 5 Lincoln Memorial 2014-10-04 337
## 6 Columbus Circle / Union Station 2014-10-02 334
## 7 Columbus Circle / Union Station 2014-10-28 333
## 8 Lincoln Memorial 2014-10-12 328
## 9 Columbus Circle / Union Station 2014-10-09 327
## 10 Columbus Circle / Union Station 2014-10-08 322
Trips %>%
mutate(date = as_date(sdate)) %>%
right_join(top_ten_date_stat,
by = c("sstation", "date"))
Trips %>%
mutate(date = as_date(sdate)) %>%
right_join(top_ten_date_stat,
by = c("sstation", "date")) %>%
mutate(day = wday(date, label = TRUE)) %>%
group_by(client, day) %>%
summarize(trips_per_day = n()) %>%
mutate(perc_trips = trips_per_day/sum(trips_per_day)) %>%
pivot_wider(id_cols = c(day),
names_from = client,
values_from = perc_trips)
Interpretation: This table showing the percentage of bike rentals by casual and registered users across the days of the week on the top ten ridership days is hyper specific: it would really only be useful to answer a specific question and is not useful for drawing general conclusions about the usage of this bike rental program. It shows that (given these narrow conditions), casual riders rented bikes more on the weekend- particularly on Saturday which was over 50%, while registered bikers rented less on the weekends and more consistantly throughout the weekdays. There are a few strange things about this table, including that Friday doesn’t appear because none of the top ten ridership days were on a Friday, and also that Thursday has a high percentage at 33% for registered users (maybe there was some kind of special event on a Thursday at one point?).
DID YOU REMEMBER TO GO BACK AND CHANGE THIS SET OF EXERCISES TO THE LARGER DATASET? IF NOT, DO THAT NOW.
This problem uses the data from the Tidy Tuesday competition this week, kids. If you need to refresh your memory on the data, read about it here.
facet_geo(). The graphic won’t load below since it came from a location on my computer. So, you’ll have to reference the original html on the moodle page to see it.DID YOU REMEMBER TO UNCOMMENT THE OPTIONS AT THE TOP?